{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "cd13460c",
   "metadata": {},
   "source": [
    "# Domain Adaptive Pre-Training (DAPT)\n",
    "\n",
    "## Goal\n",
    "\n",
    "Given a foundational language model (in this case llama-2-7B) that was pre-trained on a broad, general-purpose corpus, our goal is to further pretrain the model on a specific domain (in this example, ChipDesign) to enhance its understanding of domain-specific language and context. This process is called Domain-Adaptive Pretraining (DAPT). DAPT adapts a general-purpose model to specialized tasks within a particular field. Instead of training from scratch, we aim to “specialize” the model by focusing on a target domain corpus, allowing it to adapt to the unique vocabulary, semantics, and syntax of that field.\n",
    "\n",
    "Our primary goals with respect to DAPT are as follows:\n",
    "* Improve the model’s performance and accuracy on domain-specific tasks\n",
    "* Ensure the model retains general language capabilities\n",
    "* Minimize pretraining time by leveraging existing knowledge in the model\n",
    "\n",
    "DAPT typically enhances a model’s efficacy in downstream tasks for the domain by exposing it to domain-relevant texts. This pretraining phase can result in more accurate and context-aware predictions on domain-specific data, as the model gains an understanding of field-specific terminology, abbreviations, and common phrases."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c43ef563",
   "metadata": {},
   "source": [
    "# NeMo Tools and Resources\n",
    "\n",
    "* [NeMo Framework](https://docs.nvidia.com/nemo-framework/user-guide/latest/overview.html)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "bea0b51f",
   "metadata": {},
   "source": [
    "# Software Requirements\n",
    "* Access to latest NeMo Framework NGC Containers\n",
    "* This playbook has been tested on: nvcr.io/nvidia/nemo:dev. It is expected to work similarly on other environments.\n",
    "\n",
    "\n",
    "#### Launch the NeMo Framework container as follows: \n",
    "\n",
    "```\n",
    "docker run -it -p 8080:8080 -p 8088:8088 --rm --gpus '\"device=0,1\"' --ipc=host --network host -v $(pwd):/workspace nvcr.io/nvidia/nemo:dev\n",
    "```\n",
    "\n",
    "#### Launch Jupyter Notebook as follows: \n",
    "```\n",
    "jupyter notebook --allow-root --ip 0.0.0.0 --port 8088 --no-browser --NotebookApp.token=''\n",
    "\n",
    "```\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7137e1db",
   "metadata": {},
   "source": [
    "# Hardware Requirements\n",
    "\n",
    "* This playbook has been tested on 2xA100 80G but can be scaled to multiple GPUs as well as multiple nodes by modifying the appropriate parameters"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "91ecb0d3",
   "metadata": {},
   "source": [
    "# Data\n",
    "\n",
    "* In this playbook, we will leverage chip domain/hardware datasets from open-source GitHub repositories, wiki URLs, and academic papers. Data has been processed and curated using [NeMo Curator](https://github.com/NVIDIA-NeMo/Curator/tree/dask) as shown in this [playbook](https://github.com/jvamaraju/ndc_dapt_playbook/tree/dapt_jv). Please note that this tutorial uses NeMo Curator version 0.9.0 or lower."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ba16a72b",
   "metadata": {},
   "source": [
    "# Notebook Outline\n",
    "\n",
    "* Step 1: Prepare the data for pretraining. This is a multi-step process discussed in detail later in the specific section (later in the notebook).\n",
    "\n",
    "* Step 2: Download the llama-2-7B hugging face checkpoint and convert to .nemo format.\n",
    "\n",
    "* Step 3: Continued pretraining the llama-2-7b model using the prepared data and the custom trained tokenizer (from the previous notebook).\n",
    "\n",
    "* Step 4: Generate Results from llama-2-7b model and trained DAPT Checkpoints \n",
    "\n",
    "* Step 5: Calculate evaluation metrics"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "115e8b1f",
   "metadata": {},
   "source": [
    "# Step 0: Clone the Model Checkpoint\n",
    "\n",
    "This notebook assumed the model has been cloned from [hugging face](https://huggingface.co/meta-llama/Llama-2-7b-hf) in the mounted directory ```/dli/task/02_custom_tokenizer_training/models/weight/```"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9bc658bd",
   "metadata": {},
   "source": [
    "Clone the model: \n",
    "```\n",
    "git lfs install\n",
    "git clone https://huggingface.co/meta-llama/Llama-2-7b-hf\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ec372453",
   "metadata": {},
   "source": [
    "# Step 1: Data Preparation for Pretraining\n",
    "\n",
    "Identify the different file types (example: code, text, etc) in the pretraining data, in this case we only have 'code' type files. This is typically dataset dependent. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "3e874263-d3ef-41cc-a0ef-3743058e00da",
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "import json\n",
    "from pathlib import Path\n",
    "\n",
    "# Function to count and categorize JSONL files based on 'file_type' field\n",
    "def identify_jsonl_files(data_path):\n",
    "    code_files, text_files = [], []\n",
    "    cnt_code, cnt_text = 0, 0\n",
    "\n",
    "    for root, _, files in os.walk(data_path):\n",
    "        for file in files:\n",
    "            if not file.endswith('.jsonl'):\n",
    "                continue\n",
    "            \n",
    "            file_path = os.path.join(root, file)\n",
    "            has_code, has_text = False, False\n",
    "\n",
    "            with open(file_path, 'r') as f:\n",
    "                for line in f:\n",
    "                    try:\n",
    "                        json_obj = json.loads(line.strip())\n",
    "                        file_type = json_obj.get('file_type', '').lower()\n",
    "\n",
    "                        if file_type == 'code':\n",
    "                            has_code = True\n",
    "                        elif file_type == 'text':\n",
    "                            has_text = True\n",
    "                        \n",
    "                        if has_code and has_text:\n",
    "                            break  # No need to read further if both types are present\n",
    "\n",
    "                    except json.JSONDecodeError:\n",
    "                        continue  # Ignore malformed JSON lines\n",
    "\n",
    "            if has_code:\n",
    "                code_files.append(file_path)\n",
    "                cnt_code += 1\n",
    "            if has_text:\n",
    "                text_files.append(file_path)\n",
    "                cnt_text += 1\n",
    "\n",
    "    return code_files, text_files, cnt_code, cnt_text\n",
    "\n",
    "# Path to JSONL dataset\n",
    "data_path = '/dli/task/02_custom_tokenizer_training/curated_sample_data/curated_data'\n",
    "\n",
    "# Identify JSONL files\n",
    "code_files, text_files, cnt_code, cnt_text = identify_jsonl_files(data_path)\n",
    "\n",
    "# Output results\n",
    "print(f\"\\nNumber of files containing 'file_type': 'text': {cnt_text}\")\n",
    "print(f\"Number of files containing 'file_type': 'code': {cnt_code}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "60987ff2",
   "metadata": {},
   "source": [
    "### Merging code JSONL files into a single JSONL file for further preprocessing"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c02f2e6f",
   "metadata": {},
   "source": [
    "This is an optional step, it is possible to use multiple jsonl files in this workflow as well. This example uses a single merged. jsonl file"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "892f4493",
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "import json\n",
    "\n",
    "def list_jsonl_files(directory):\n",
    "    jsonl_files = []\n",
    "    for root, _, files in os.walk(directory):\n",
    "        for file in files:\n",
    "            if file.endswith('.jsonl'):\n",
    "                jsonl_files.append(os.path.join(root, file))\n",
    "    return jsonl_files\n",
    "\n",
    "# Function to merge multiple jsonl files into a single file \n",
    "def merge_jsonl_files(directory, output_file):\n",
    "    jsonl_files = list_jsonl_files(directory)\n",
    "    \n",
    "    with open(output_file, 'w') as outfile:\n",
    "        for input_file in jsonl_files:\n",
    "            with open(input_file, 'r') as infile:\n",
    "                for line in infile:\n",
    "                    try:\n",
    "                        json_object = json.loads(line.strip())\n",
    "                        json.dump(json_object, outfile)\n",
    "                        outfile.write('\\n')\n",
    "                    except json.JSONDecodeError:\n",
    "                        print(f\"Skipping invalid JSON in {input_file}: {line.strip()}\")\n",
    "\n",
    "    print(f\"Merged {len(jsonl_files)} JSONL files into {output_file}\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "9bb0c80a",
   "metadata": {},
   "outputs": [],
   "source": [
    "directory = '/dli/task/02_custom_tokenizer_training/curated_sample_data/curated_data'\n",
    "output_file = '/dli/task/02_custom_tokenizer_training/curated_sample_data/code_merged_output.jsonl'\n",
    "merge_jsonl_files(directory, output_file)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6d00ad63",
   "metadata": {},
   "source": [
    "### Data Format Conversion for pretraining: JSONL to bin/idx files \n",
    "\n",
    "For efficient pretraining, we convert data from JSONL to bin/idx format. \n",
    "\n",
    "JSONL files, while convenient for storing structured text data, are not optimized for high-speed data loading during large language model training. In pretraining workflows, particularly those with large datasets and complex model architectures, the need for fast data access and efficient memory management is essential.\n",
    "\n",
    "The bin/idx format is a binary format specifically designed to facilitate high-throughput data loading. This format allows direct, randomized access to data samples, which speeds up I/O operations and reduces the memory footprint compared to loading JSONL files. By converting data to bin/idx format, hardware utilization can be maximized and bottlenecks in data processing can be avoided, leading to a more efficient pretraining process.\n",
    "\n",
    "#### Benefits of bin/idx format for Pretraining:\n",
    "\n",
    "* **Optimized I/O Performance:** The binary format enables quicker data reads and reduces latency, allowing the model to continuously access data at high speeds.\n",
    "* **Efficient Memory Usage:** Data in bin/idx format consumes less memory during loading, making it suitable for large datasets and enabling better use of available system resources.\n",
    "* **Enhanced Scalability:** With bin/idx, it’s easier to handle shuffling and batching of large datasets, which is essential for pretraining on diverse domain-specific data."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "de696d7b",
   "metadata": {},
   "source": [
    "Modify the `input` to point to the merged `jsonl` file. Similarly modify paths to `vocab`, `tokenizer-model`, `merge-file` to point to relevant file paths. `tokenizer-model` should point to the custom tokenizer (trained in the custom tokenizer training notebook) if your data has domain specific terminology"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "89b9583d-1dac-4717-b028-c78d0d703f45",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Using default Llama-2 tokenizer for testing purpose\n",
    "!python3 /opt/NeMo/scripts/nlp_language_modeling/preprocess_data_for_megatron.py \\\n",
    "--input='/dli/task/02_custom_tokenizer_training/curated_sample_data/code_merged_output.jsonl' \\\n",
    "--json-keys=text \\\n",
    "--tokenizer-library=sentencepiece \\\n",
    "--dataset-impl mmap \\\n",
    "--tokenizer-model '/dli/task/02_custom_tokenizer_training/models/weight/llama2-7b-hf/tokenizer.model' \\\n",
    "--tokenizer-type llama \\\n",
    "--append-eod \\\n",
    "--output-prefix='preprocessed_data'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "dcbf66a2",
   "metadata": {},
   "outputs": [],
   "source": [
    "#### Uncomment to use custom trained tokenizer ####\n",
    "# !python3 /opt/NeMo/scripts/nlp_language_modeling/preprocess_data_for_megatron.py \\\n",
    "# --input='/workspace/dapt-custom-tokenization/code_merged_output.jsonl' \\\n",
    "# --json-keys=text \\\n",
    "# --tokenizer-library=sentencepiece \\\n",
    "# --vocab '/workspace/dapt-custom-tokenization/code/code/models/tokenizer/llama2/custom_tokenizer_init_20000.json/vocab.json' \\\n",
    "# --dataset-impl mmap \\\n",
    "# --tokenizer-model '/workspace/Llama-2-7b-hf/tokenizer.model' \\\n",
    "# --tokenizer-type llama \\\n",
    "# --merge-file '/workspace/dapt-custom-tokenization/code/code/models/tokenizer/llama2/custom_tokenizer_init_20000.json/merges.txt' \\\n",
    "# --append-eod \\\n",
    "# --output-prefix='preprocessed_data'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "0f05efa5",
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "# If the above step runs successfully, two files with the extensions .bin and .idx will be generated\n",
    "!ls "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "82f95149",
   "metadata": {},
   "source": [
    "# Step 2: Convert Llama-2-7b Hugging Face Checkpoint to NeMo2.0 format\n",
    "\n",
    "Llama-2-7B model can be automatically downloaded and converted to NeMo2 format with the following script:\n",
    "\n",
    "* Save the following code snippet as ```convert_ckpt_nemo2.py```\n",
    "* Run ```python3 convert_ckpt_nemo2.py```"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b3260b62-c179-4bc6-b256-729ff6403fa4",
   "metadata": {},
   "source": [
    "```\n",
    "from nemo.collections import llm\n",
    "from nemo.collections.llm import Llama2Config7B\n",
    "\n",
    "if __name__ == \"__main__\":\n",
    "    output = llm.import_ckpt(\n",
    "        model=llm.LlamaModel(config=Llama2Config7B()),\n",
    "        source=\"hf:///workspace/Llama-2-7b-hf\",\n",
    "    )\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b94e774b",
   "metadata": {},
   "source": [
    "The conversion will generate a ```llama-2``` NeMo2 checkpoint directory which can be used for the continued pretraining using NeMo Toolkit as shown in Step 3 in default ```$NEMO_HOME``` folder, unless otherwise specified ```NEMO_HOME``` is set as ```/root/.cache/nemo```\n",
    "\n",
    "Alternatively, you can directly use ```source=\"meta-llama/Llama2-7b-hf\"``` to use the model directly from Hugging Face instead of using the locally downloaded version in ```\\workspace```"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7b887cf2-cf65-46c5-b174-6ae18952d7e3",
   "metadata": {},
   "source": [
    "```\n",
    "cd /dli/task/02_custom_tokenizer_training/models/weight\n",
    "python3 /dli/task/03_domain_adaptive_pretraining/convert_ckpt_nemo2.py\n",
    "\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c689e584",
   "metadata": {},
   "outputs": [],
   "source": [
    "!ls /root/.cache/nemo/models/llama2-7b-hf"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "68a99de3-2061-4b4b-bd3f-9cfd2ce80b0b",
   "metadata": {},
   "outputs": [],
   "source": [
    "!ls /root/.cache/nemo/models/llama2-7b-hf/weights/"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "fe1bdfe0",
   "metadata": {},
   "source": [
    "# Step 3: Continued Pretraining using Llama2-7b\n",
    "\n",
    "For this step we use a predefined recipe `llama2_7b.pretrain_recipe` from NeMo Toolkit for continued pretraining. We will modify the `pretrain_recipe` and use it for continued pretraining workflow. Typically this involves changing dataset files and data blends, changing learning rate scheduler, changing default parallelism based on number of devices available, adding connector to resume training, etc.\n",
    "\n",
    "First, we define the recipe and executor for using NeMo2 as follows:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "4a40f547",
   "metadata": {},
   "outputs": [],
   "source": [
    "import nemo_run as run\n",
    "from nemo.collections import llm\n",
    "from nemo.collections.llm import Llama2Config7B\n",
    "\n",
    "# Configure recipe to pre-train based on the default Llama-2-7B recipe\n",
    "def configure_recipe(nodes: int = 1, gpus_per_node: int = 1):\n",
    "    recipe = llm.llama2_7b.pretrain_recipe(\n",
    "        name=\"llama2_7b_dapt\",\n",
    "        num_nodes=nodes,\n",
    "        num_gpus_per_node=gpus_per_node,\n",
    "    )\n",
    "\n",
    "    # Set parallelism and validation parameters\n",
    "    strategy = recipe.trainer.strategy\n",
    "    strategy.context_parallel_size = 1\n",
    "    strategy.tensor_model_parallel_size = 1\n",
    "    recipe.trainer.val_check_interval = 10\n",
    "\n",
    "    return recipe\n",
    "\n",
    "# Executor for running pretraining \n",
    "def local_executor_torchrun(devices: int = 1) -> run.LocalExecutor:\n",
    "    executor = run.LocalExecutor(ntasks_per_node=devices, launcher=\"torchrun\")\n",
    "    return executor"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "464d303fc973333d",
   "metadata": {},
   "source": [
    "Let's instantiate the `recipe` and modify it so that it uses the desired number of GPUs, resuming from the pretrained Llama2-7b checkpoint instead of training from scratch.\n",
    "\n",
    "The default `recipe` initializes all the essential components required for Llama2 7B pretraining, including model, dataloader, trainer, logger, optimizer etc. `recipe` is not executed during instantiation, so it is very simple to modify it to fit your custom training workflow. In our case, we want to do the DAPT (instead of pretraining from scratch), and all we need to do is to add a `resume` config which points to the Llama2 7B checkpoint.\n",
    "\n",
    "You can easily change the optimizer, parallelism, data as per your use case. Look at the following example for guidance on how to tweak these parameters. Note: you are only configuring your task at this stage; the underlying code is not executed unless you launch the job using the executor."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "9d440bfa-8941-439e-aa52-ab6bef0a56f6",
   "metadata": {},
   "outputs": [],
   "source": [
    "!rm -r /dli/task/02_custom_tokenizer_training/models/weight/llama2-7b-hf/.cache"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ed9f2569-a4dc-465c-9626-98e278d67733",
   "metadata": {},
   "outputs": [],
   "source": [
    "import nemo.lightning as nl\n",
    "from nemo.collections.common.tokenizers import AutoTokenizer\n",
    "\n",
    "# Define dataset configuration\n",
    "data = run.Config(\n",
    "    llm.PreTrainingDataModule,\n",
    "    paths=['/dli/task/03_domain_adaptive_pretraining/preprocessed_data_text_document'],\n",
    "    seq_length=4096,\n",
    "    tokenizer=run.Config(\n",
    "        AutoTokenizer,\n",
    "        pretrained_model_name=\"/dli/task/02_custom_tokenizer_training/models/weight/llama2-7b-hf\",\n",
    "    ),\n",
    "    micro_batch_size=1,\n",
    "    global_batch_size=8,\n",
    ")\n",
    "\n",
    "# Instantiate the recipe\n",
    "recipe = configure_recipe(nodes=1, gpus_per_node=2)\n",
    "\n",
    "# Configure resume settings\n",
    "recipe.resume = run.Config(\n",
    "    nl.AutoResume,\n",
    "    restore_config=run.Config(nl.RestoreConfig, path=\"/root/.cache/nemo/models/llama2-7b-hf\"),\n",
    ")\n",
    "\n",
    "# Ensure tokenizer is set\n",
    "recipe.data.tokenizer = data.tokenizer\n",
    "\n",
    "# Configure parallelism settings\n",
    "recipe.trainer.strategy.tensor_model_parallel_size = 2\n",
    "recipe.trainer.strategy.pipeline_model_parallel_size = 1\n",
    "recipe.trainer.strategy.context_parallel_size = 1\n",
    "\n",
    "# Configure training steps and validation intervals\n",
    "recipe.trainer.max_steps = 20\n",
    "recipe.trainer.max_epochs = 1\n",
    "recipe.trainer.val_check_interval = 10\n",
    "recipe.trainer.limit_val_batches=5\n",
    "\n",
    "# Set batch size settings\n",
    "recipe.data.global_batch_size = data.global_batch_size\n",
    "recipe.data.micro_batch_size = data.micro_batch_size\n",
    "recipe.data.num_val_samples = 128  # Adjust based on dataset size\n",
    "\n",
    "# Set checkpoint and log locations\n",
    "recipe.log.log_dir = \"/workspace/logs_03_15\"\n",
    "recipe.log.ckpt.save_optim_on_train_end = True\n",
    "\n",
    "# Configure learning rate scheduler\n",
    "recipe.optim.config.lr = 1e-5\n",
    "recipe.optim.lr_scheduler.min_lr = 1e-6\n",
    "\n",
    "# Assign dataset configuration\n",
    "recipe.data = data\n",
    "\n",
    "# Configure data blending (if needed)\n",
    "recipe.data.paths = [1, '/dli/task/03_domain_adaptive_pretraining/preprocessed_data_text_document']"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "303b9f780763d641",
   "metadata": {},
   "source": [
    "After configure the training procedure properly, we can run the training by instantiate the `executor` and use `nemorun` to start the training:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "1c1f8b3071d8ff80",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Launch the pretraining job \n",
    "executor = local_executor_torchrun(devices=recipe.trainer.devices)\n",
    "run.run(recipe, executor=executor)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "6e2dad82-017c-49b4-abc0-017de9e5ddd9",
   "metadata": {},
   "outputs": [],
   "source": [
    "### Modify checkpoint path\n",
    "!ls -lh /workspace/logs_03_15/llama2_7b_dapt/2025-03-18_05-20-49/checkpoints/"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cf30d8c8",
   "metadata": {},
   "source": [
    "### To monitor the training, launch Tensorboard from another terminal\n",
    "\n",
    "`tensorboard --logdir nemo_experiments --bind_all`"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a57a6680",
   "metadata": {},
   "source": [
    "# Step 4: Generate Results from Llama-2-7b Model and Trained DAPT Checkpoints"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f5cfef27",
   "metadata": {},
   "source": [
    "We use the `llm.generate` API in NeMo 2.0 to generate results from the trained DAPT checkpoint. Find your last saved checkpoint from your experiment dir: `/workspace/logs_01_31/llama2_7b_dapt/2025-02-27_00-43-49/checkpoints`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d800d37d",
   "metadata": {},
   "outputs": [],
   "source": [
    "### Modify checkpoint path \n",
    "dapt_ckpt_path=str(next((d for d in Path(\"/workspace/logs_03_15/llama2_7b_dapt/2025-03-18_05-20-49/checkpoints/\").iterdir() if d.is_dir() and d.name.endswith(\"-last\")), None))\n",
    "print(\"We will load DAPT checkpoint from:\", dapt_ckpt_path)\n",
    "base_ckpt_path=Path(\"/root/.cache/nemo/models/llama2-7b-hf/\")\n",
    "print(\"We will load base model checkpoint from:\", base_ckpt_path)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9d8779c8",
   "metadata": {},
   "source": [
    "When using `llm.generate` API, you can also pass a [custom data module](https://github.com/NVIDIA/NeMo/blob/main/tutorials/llm/llama/nemo2-sft-peft/nemo2-sft.ipynb). Here we will use a sample verilog dataset to generate predictions.  For a quick demonstration, we will use the first 100 lines as an example input.The input JSONL file should contain input and output fields (additional keys are optional). In the following example, the generated predictions are saved to the `dapt_predictions.jsonl` file. Note that while fine-tuning required a minimum of 2 GPUs with `tensor_model_parallel_size=2`, generating predictions only requires `tensor_model_parallel_size=1`. However, using multiple GPUs can speed up the inference."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "9ca90568-5e44-404a-bdb1-7ae08bcca3aa",
   "metadata": {},
   "outputs": [],
   "source": [
    "!ls /dli/task/03_domain_adaptive_pretraining/data/"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "bcca4ccf",
   "metadata": {},
   "outputs": [],
   "source": [
    "%%bash\n",
    "head -n 30 /dli/task/03_domain_adaptive_pretraining/data/MG-Verilog_high_level_global_summary_in_out_test.jsonl > /dli/task/03_domain_adaptive_pretraining/evals/toy_verilog_test.jsonl\n",
    "head -n 1 /dli/task/03_domain_adaptive_pretraining/evals/toy_verilog_test.jsonl"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "6adfc26a-3284-46c8-badb-7f5dcc0362ad",
   "metadata": {},
   "outputs": [],
   "source": [
    "input_data=\"/dli/task/03_domain_adaptive_pretraining/evals/toy_verilog_test.jsonl\"\n",
    "output_path_base=\"/dli/task/03_domain_adaptive_pretraining/evals/llama2_7b_base_prediction.jsonl\"\n",
    "output_path_dapt=\"/dli/task/03_domain_adaptive_pretraining/evals/dapt_prediction.jsonl\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "0fc10e7e",
   "metadata": {},
   "outputs": [],
   "source": [
    "from megatron.core.inference.common_inference_params import CommonInferenceParams\n",
    "from nemo.collections.llm.recipes.precision.mixed_precision import bf16_mixed\n",
    "from nemo import lightning as nl\n",
    "\n",
    "def trainer() -> run.Config[nl.Trainer]:\n",
    "    strategy = run.Config(\n",
    "        nl.MegatronStrategy,\n",
    "        tensor_model_parallel_size=2\n",
    "    )\n",
    "    trainer = run.Config(\n",
    "        nl.Trainer,\n",
    "        accelerator=\"gpu\",\n",
    "        devices=2,\n",
    "        num_nodes=1,\n",
    "        strategy=strategy,\n",
    "        plugins=bf16_mixed(),\n",
    "    )\n",
    "    return trainer\n",
    "\n",
    "# Configure inference to predict on base model checkpoint\n",
    "def configure_inference_base():\n",
    "    return run.Partial(\n",
    "        llm.generate,\n",
    "        path=str(base_ckpt_path),\n",
    "        trainer=trainer(),\n",
    "        input_dataset=input_data,\n",
    "        inference_params=CommonInferenceParams(num_tokens_to_generate=50, top_k=1),\n",
    "        output_path=output_path_base,\n",
    "    )\n",
    "\n",
    "# Configure inference to predict on trained DAPT checkpoint\n",
    "def configure_inference_dapt():\n",
    "    return run.Partial(\n",
    "        llm.generate,\n",
    "        path=str(dapt_ckpt_path),\n",
    "        trainer=trainer(),\n",
    "        input_dataset=input_data,\n",
    "        inference_params=CommonInferenceParams(num_tokens_to_generate=50, top_k=1),\n",
    "        output_path=output_path_dapt,\n",
    "    )\n",
    "\n",
    "\n",
    "def local_executor_torchrun(nodes: int = 1, devices: int = 2) -> run.LocalExecutor:\n",
    "    # Env vars for jobs are configured here\n",
    "    env_vars = {\n",
    "        \"TORCH_NCCL_AVOID_RECORD_STREAMS\": \"1\",\n",
    "        \"NCCL_NVLS_ENABLE\": \"0\",\n",
    "    }\n",
    "\n",
    "    executor = run.LocalExecutor(ntasks_per_node=devices, launcher=\"torchrun\", env_vars=env_vars)\n",
    "\n",
    "    return executor\n",
    "\n",
    "if __name__ == '__main__':\n",
    "    run.run(configure_inference_base(), executor=local_executor_torchrun())\n",
    "    run.run(configure_inference_dapt(), executor=local_executor_torchrun())"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e5dab58e",
   "metadata": {},
   "source": [
    "# Step 5: Calculate Evaluation Metrics"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cd4e4c16",
   "metadata": {},
   "source": [
    "We can evaluate the model's predictions by calculating the Exact Match (EM) and F1 scores.\n",
    "\n",
    "- Exact Match is a binary measure (0 or 1) checking if the model outputs match one of the ground truth answer exactly.\n",
    "- F1 score is the harmonic mean of precision and recall for the answer words.\n",
    "\n",
    "Below is a script that computes these metrics. The sample scores can be improved by training the model further and performing hyperparameter tuning. In this notebook, we only train for 20 steps."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "cd16f82c",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Scores from base model\n",
    "!python /opt/NeMo/scripts/metric_calculation/peft_metric_calc.py --pred_file /dli/task/03_domain_adaptive_pretraining/evals/llama2_7b_base_prediction.jsonl --label_field \"label\" --pred_field \"prediction\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "7bd94ad8",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Scores from DAPT model\n",
    "!python /opt/NeMo/scripts/metric_calculation/peft_metric_calc.py --pred_file /dli/task/03_domain_adaptive_pretraining/evals/dapt_prediction.jsonl --label_field \"label\" --pred_field \"prediction\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "df338bc6-134b-4e43-bbd2-146c7601029f",
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.12.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
